Data Driven Prediction of NFL Player Salaries

Jonas da Silva

Senay Alemayehu

Andrew Mendez

Sean Sweeny

Introduction:

The National Football League (NFL) stands as a multi-billion dollar industry with a massive global fanbase. Fans invest significant amounts of time and resources to follow and support their favorite teams and players. Teams, in turn, strive to fulfill their duty to the fans by providing the best possible experience, whether through stadium renovations or building winning teams that captivate viewers. Each year, teams allocate approximately 225 million dollars to player salaries through contracts. On the surface, it seems straightforward - teams spend money on players they believe will benefit the team, and success follows. However, the reality is far more nuanced and complex.

Teams often find themselves making critical decisions that impact their financial investments. There are instances where players are overpaid, leading to adverse effects on team performance. Conversely, teams may pass on certain players who turn out to be valuable assets for other teams. Furthermore, determining the worth of each position becomes an intricate matter. Positions hold varying levels of importance within a team's strategy, thereby influencing their respective salaries. Other factors, such as age and past performance, must also be considered. However, even with a meticulous evaluation process, outcomes may still deviate from expectations due to factors like work ethic, fluctuating performance, or unpredictable anomalies.

With this in mind, we aim to address the following question: which factors can be reliably accounted for when determining player salaries? To answer this question, we will explore both basic box score predictions and the ProFootballFocus grading system, which provides a comprehensive breakdown of every play by every player. This contextual analysis adds depth to traditional statistics, which can sometimes be misleading. For example, a quarterback may deliver a perfect pass that is dropped, resulting in an incomplete pass. Our analysis will encompass a wide range of statistics, both advanced and fundamental, to ascertain their correlation with player salaries and determine the extent of their impact. This will provide valuable insights into how past performance can be used to predict a player's worth and the value they bring to a team.

To achieve this, we will delve into various categories, such as passing, rushing, receiving, blocking, run defense, pass rush, and coverage. By individually examining each category, we can analyze different positions and distinguish the value associated with specific skills. Through this comprehensive exploration, we aim to shed light on the factors that best correlate with salary and improve our ability to predict a player's monetary value based on their performance history.

In the code above, we import the pandas library as pd and numpy library as np. We also import the warnings module to suppress any warning messages.

Next, we use the pd.read_html() function to read the HTML content from the specified URL. The flavor='html5lib' parameter ensures that the HTML is parsed correctly. The function returns a list of DataFrame objects, with each DataFrame representing a table from the HTML content.

In our case, we assume that the desired table is present on the webpage and will be accessible at the first index of the returned list (tables[0]). However, it's important to note that the HTML structure of the webpage might change over time, so it's always good to verify that the table is being extracted correctly.

The table contains various columns of data, including player salary information such as signing date, total value, average annual value (AAV), and details regarding guaranteed salary.

Once we have successfully obtained the salary data in a DataFrame, we can proceed with further exploratory data analysis and hypothesis testing to gain insights into the relationship between salaries and player performance.First we must collect data, to do this we used the python library Pandas to read the html. Since the website we are collecting from https://www.spotrac.com/nfl/contracts/sort-value/limit-2000/ stores the data we're looking for in a table tag we can easily grab it and store it in a dataframe. This website contains information for players salary including when they signed, total value, average annual value (AAV), and information relating to guaranteed salary.

Now we need to get the player stats. Do do this we extracted csv files from https://www.pff.com/ which contain many useful statistics for each position. Again we are going to store this in a pandas dataframe to be consistent, since we will use all the datasets together. We also need to clean the data. We do this by melting the data and removing any years in which very minimal snaps were played and to only include positions we want. We want to differentiate the years because of how important it is to account for. A player having a good season 5 years ago is not nearly the same as them playing well in the last season.

Here are the rushing stats, we will limit these to just running backs.

Here are receiving stats. These are a little more complicated since these will contain primary statistics for tight ends and wide receivers. We will use the data from these files for two seperate dataframes one for the wide receivers and one for the tight ends.

Here are the blocking stats. These contain primary stats for offensive lineman which are typically broken down into 3 positions. Center, guard and tackle. We will create three seperate dataframes for this reason.

On the defensive side we have to do different things with our dataframes. Instead of one category containing information for multiple positions in this case our positions contain information we want from multiple categories. For the interior defensive line and edge rushers, we want stats from our run defense and pass rushing dataframes. For linebackers, we want coverage and run defense statistics. To do this we will merge on positions and player names so that we can split our data up accordingly.

For cornerbacks and safeties, we only need to worry about coverage, so we will split it the same way we did for offensive players.

Now we need to explore our data and understand it. This is where we analyze the features of our dataset and start to determine which ones are more valueable for our hypothesis. To do this we will use single vector decomposition. We will import svds from the scipy linear algebra library and apply it to our data. We need to first remove any irrelevant data from ourdatasets such as player_id, franchise_id and team and handle any missing data. For missing data we replaced it with the mean from that column. Since pretty much all missing data has already been dropped this will not affect the analysis here much.

These values indicate how much correlation there are between all the data. Each element in the vector represents its respective column. If the value is closer to 0 that means there is high correlation between that feature and the dataset. Now we will find and list out the features themselves in order from most correlated to least to give us a good picture of the value for our features.

We want to plot our eigenvalues to get an idea of which vectors tell us the most about our data.

Now we want to sort the last vector in the Vt matrix in order to vizualise the variability of our features. The larger values indicate stronger variability. With this we must also sort the 3 before that. As seen in the graph above the last 4 dots are growing at a much faster rate than the rest of the dots. We want to visualize those 4 vectors but we also must keep the sorting consistent so we will store all the values in vectors at the same time to not lose order of our features.

Now we will put everything together in one graph. We will show a scatter diagram from each of the 4 vectors. We will display the values for each feature individually. This will give us a good idea of the variability of the features so that we can understand which ones will be useful when trying to predict salary and which ones will tell us things we already know and don't need to be trained on at the risk of overfitting.

Now we want to explore some of these features individually. Let's start with yards since in our principal component analysis it showed to be the feature with the most variability. Now we want to see how it correlates with our targets.

It's fairly hard to tell what is going on but we do indeed see a positive correlation between the yards and the salary, with the players earning a higher salary generally throwing for more yards than those who aren't. The next notable feature we want to see is dropbacks.

It looks surprisingly similar. Let's look at pff grade, which again is a grade made by analysts at pro football focus where they grade every play of the player.

What is very noticeable between all of these graphs is the two clusters that appear in the plots. These clusters are separated by salary. The lower cluster represents the players on the rookie contracts while the upper cluster are players who have been paid second contracts, which are significantly higher. When a player is first drafted they are essentially assigned a contract that they play on for 4-5 years before being eligible for a new one. Typically, the rookie contracts are farily cheap, and if the player is good enough they either get extended or sign with another team, typically for a lot more money then they were making in the past. This is a very important thing to distinguish in our data. We will explore this further with a k means analysis which will give us a good idea of the split between our clusters.

We get a very clear and obvious cluster, and the data within each cluster is vastly different with the yellow cluster being more condensed and the purple cluster being a lot more spread out.

The above violin plot shows that the above 20 million players have a much higher yards run rate than the under 20 million players. This could be explained by how often players are allowed to play. Big name players get paid more, and get more opportunities to run more yards than lower paid players.

This gives us a better picture of how our clusters are distributed. In our left cluster we have more datapoints focused within one area at around 250 yards per game. Our other cluster is a tad more spread with the bulk of datapoints lying around 200 yards per game. We can clearly see there is a correlation between getting paid more and throwing for more yards. But does this mean that we can correlate well within our clusters to find more insightful discoveries? Let's explore further. First let's average out our quarterbacks stats rather than have them seperate by year in order to get a more general picture.

Now we want to see the relationship between a players average yards and their salary.

Now let's analyze our clusters. Let's start with the less than 20 million cluster.

More than 20 million cluster

Now let's look at the relationship between players before they receive their contract.

Now let's look at how players perform after they receive their contracts.

Now let's explore the upper cluster, the quarterbacks who have received a notable second contract. Understanding these players will help us use our data to figure out how to better predict what contracts should look like.

There appears to be high correlation between salary and yards for players who are paid less. However, for players that are paid high, there is negiligeable correlation, which means that it is both not strong and not a good predictor for future salary. However, we do see slightly more promise with other features such as big time throws and pff grade. We will now further explore feature selection in order to see if we can select a good amounut of features that will be able to predict salary together. We want to now figure out any features that are redundant with each other or irrelevant to predicting the salary. Understanding our data in this way will allow for cleaner built models that converge quicker, generalize and produce more accurate results more efficiently. We will start by seeing how our features correlate with the AAV and to drop features in which there are none.

The cutoff we will decide on is spikes, which based on intuition about football has very little to do with analying performance of a player. Everything with less correlation to AAV than that will also be dropped.

Now we have reduced our feature set a little bit. Now we want to remove features that are redundant with each other. To do this we will use the pandas scatter matrix plotting function to visualise this intuitively. What this does is that it will plot the correlation between every feature individually. With this we can see features that are highly correlated and therefore redundant with each other.

From this graph we can observe a few datapoints to drop. First off aimed passes is a highly redundant feature so we will drop it. Other features are similar, including completions, dropbacks, first downs, big time throws and yards. Because big time throws is the most correlated with AAV we will keep that one and drop the others. Offense and passing grades are also highly redundant as passing grade is a subset of the offensive grade, we will drop this. We can see some correlation in other areas but we will leave those for now. This leaves us with now 18 features.

Nothing seems blatenly correlated. Now we want to test how many features are considered valueable for our dataset. For this we will do a sequential feature selection and test out how effective the number of features are on a linear regression model. The SequencialFeatureSelection from sklearn is a greedy algorithm which will recursively select the next best feature for the model and go up to the specified amount. We looped over all possible number of features to see if it converged early so that we could possibly remove some features.

There is a big jump from 15 to 16 features but there is not as big of one from 16 to 17, so we will move forward with the top 16 features.

Now we will do some machine learning to create a model that can predict salary. We will use some of the observations we used from our exploration analysis to experiment with multiple models. For the most part, we will be using neural networks. We will be using tensorflow, a python library that creates neural networks and performs calculations into a model. It allows us to specify how many hidden layers and nodes we have. The finer details are handeled on the back end like the weight adjustments, which will determine how we get from input to output. For our model, we will use 1 output since our targets are already closely correlated. We will use relu activation function which will not allow us to have negative values ane we will normalize all our data. We will train on 80% of our data and hold out 20% for testing. We will have 2 hidden layers.

Let's now train on how players perform before their contract.

Now lets train on our upper cluster before they receive their contracts.

Lets train on quarterbacks after they receive their contracts.

There simply isn't enough data to make any conclusions from here. From our neural network we found that both our upper cluster and players before their contract offer more accurate training data. The most accurate we got was the model where we trained on the cluster of quarterbacks that received their second contracts and look at how they played before their contracts. This proved to be a fairly accurate model in terms of being able to predict salary on our testing data. Now lets try a couple of other machine learning models starting with K-nearest neighbors. This is a model which takes a new data point and looks at the points around it that are known and makes a prediction based off of them. Sklearn has a function that allows us to do this while specifying the amount of neighbors we want to observe.

This is comparable to our neural net but still a pretty good model in terms of what we're looking for. Now let's try linear regression.

This is clearly not as good as our KNN or neural network. Now let's use our models to predict the average salary of the next two quarterbacks anticipated to get massive contracts, Joe Burrow and Justin Herbert.

Based on real life intuition, these predictions are a bit on the lower side, possibly due to not being able to account for inflation without limiting the dataset too much. After exploring and analyzing the NFL player stats data, we observed several interesting findings. We found that certain features like yards, completions and dropbacks were highly correlated and not needed in a predictive model. Additionally, we were able to use machine learning techniques such as neural networks and K-Nearest Neighbors to predict player performance based on their data. This project goes through the data science lifecycle. We started with data collection and processing, moving on to exploration to better understand the data and identify any issues or missing values. We went through feature selection, selecting and transforming relevant features to improve model performance. Next, we used various machine learning algorithms to build models and make predictions on player performance. Overall, this project demonstrates the importance of using exploratory data analysis and machine learning techniques to gain insights from complex data. It highlights the iterative nature of the data science lifecycle, where each step informs and improves the subsequent steps, ultimately leading to a better understanding of the data and better models. We also proved a hypothesis that based on certain data we can predict players contracts on a certain interval of confidence.

In conclusion, our analysis of NFL player performance before and after signing contracts provides valuable insights into the justification of player salaries and performance outcomes. By examining a range of features and utilizing machine learning techniques, we gained a deeper understanding of the relationship between performance and contract values.

Our findings indicate that performance before signing a contract showed a strong correlation with the subsequent salary received by players. This suggests that teams and decision-makers in the NFL consider pre-contract performance as a crucial factor when determining player salaries. Players who demonstrated exceptional performance in these areas were more likely to secure higher-paying contracts.

However, our analysis also revealed that the correlation between performance after signing a contract and salary was not as pronounced. While certain features such as big-time throws and PFF grade showed potential as predictors, the overall relationship between post-contract performance and salary was less significant. This suggests that other factors, such as team dynamics, coaching, and player development, might influence post-contract performance to a greater extent than the initial performance that justified the contract.

It is important to note that our analysis focused solely on statistical correlations and does not capture the full context and complexity of player evaluations in the NFL. Factors such as player injuries, team schemes, and player roles within the team might also play a significant role in post-contract performance.

Therefore, while pre-contract performance seems to be an influential factor in justifying player salaries, post-contract performance cannot be solely determined by pre-contract performance. Additional factors beyond individual player statistics should be considered when evaluating player performance and contract value.

In summary, our analysis provides insights into the relationship between player performance before and after signing contracts in the NFL. It highlights the importance of pre-contract performance in justifying player salaries but suggests that post-contract performance is influenced by various other factors. Further research and consideration of contextual factors are necessary to comprehensively evaluate the value and impact of player contracts in the NFL.

Helpful links:

PFF player grading: https://www.pff.com/grades

PFF passing grades: https://premium.pff.com/nfl/positions/2022/REGPO/passing?position=QB

PFF receiving grades: https://premium.pff.com/nfl/positions/2022/REGPO/receiving?position=QB

PFF rushing grades: https://premium.pff.com/nfl/positions/2022/REGPO/rushing?position=WR,TE,RB

PFF blocking grades: https://premium.pff.com/nfl/positions/2022/REGPO/offense-blocking?position=HB,FB

PFF run defense grades: https://premium.pff.com/nfl/positions/2022/REGPO/defense-run?position=T,G,C,TE,RB

PFF pass rush grades: https://premium.pff.com/nfl/positions/2022/REGPO/defense-pass-rush?position=DI,ED,LB,CB,S

PFF coverage grades: https://premium.pff.com/nfl/positions/2022/REGPO/defense-coverage?position=DI,ED,LB,CB,S

Salary info: https://www.spotrac.com/nfl/contracts//

More NFL stats: https://www.pro-football-reference.com/